261 Hours For A 300M Model And I Have Every Optimization

I have every optimization under the sun enabled. Native NVFP4 quantization. Torch.compile with max auto tune and cudagraphs. No gradient accumulation. Maximum batch size. My GPU is locked at 600W. My clocks are fixed. My cooling is liquid. Everything is perfect.

It is still taking 261 hours to train TMLM-Sonnet. That is 300 million parameters. Not Opus. Not some massive model. Sonnet. The middle child. The one that should be quick. The one that should not require me to plan my life around training runs.

I have optimized everything except the one thing that matters. Time.

The Optimization Stack

Here is what I am running. Every trick I know. Every flag I could find. Every optimization that promised speed.

Native NVFP4 Quantization ENABLED
Torch.compile Max Auto Tune ENABLED
CUDAGraphs ENABLED
Gradient Accumulation DISABLED
Batch Size MAXED
GPU Clocks LOCKED
Power Limit 600W
Memory Transfer Rate 34 Gbps

Everything is green. Everything is enabled. Everything should be fast. Nothing is fast.

The Numbers

261

Hours for Sonnet

300M

Parameters (Sonnet)

~550

Hours for Opus (est)

600M

Parameters (Opus)

Two hundred sixty-one hours. That is almost eleven days. For a 300 million parameter model. In 2026. With a 5090. With every optimization. I could train a 7B model faster on a cloud cluster in 2024.

The Math Does Not Math

I did the calculations. With NVFP4 I should get 4x memory efficiency. With torch.compile I should get 1.5x to 2x speedup. With cudagraphs I should reduce kernel launch overhead. With max batch size I should maximize throughput.

The math says I should be done in 50 hours. The reality says 261 hours. There is a gap. The gap is my life.

                        # Expected vs Actual training time

                        Expected: 50 hours

                        Actual: 261 hours

                        Difference: 211 hours

                        # Those 211 hours are my soul

What I Have Tried

I have profiled everything. PyTorch profiler. Nsight Systems. Nsight Compute. I have graphs. I have timelines. I have flame graphs that look like modern art. I know exactly where time is spent. I do not know how to fix it.

Data loading is not the bottleneck. I preprocessed everything. I use mmap. I use pinned memory. I use multiple workers. The data is ready. The GPU is waiting. For what? I do not know.

Memory is not the bottleneck. I have 32 GB of VRAM. The model uses 12 GB. The rest is free. I could fit more. I could batch more. I cannot go faster.

The Torch.compile Situation

Torch.compile with max auto tune takes forever to compile. The first run is slow. The second run is slower. The third run crashes. The fourth run works. Then I change one line of code and it compiles again. For an hour.

Cudagraphs help. Sometimes. When they work. When they do not work they give me cryptic errors about tensor shapes that do not make sense. I spend more time debugging cudagraphs than I save from using them.

Optimization is just debugging with extra steps and more disappointment.

The NVFP4 Reality

Native NVFP4 is new. It is experimental. It is supposed to be fast. It is fast when it works. It crashes when it does not work. The error messages are unhelpful. The documentation is sparse. The community is small.

I enabled it. It works. Mostly. Sometimes. I think. The speedup is real but not as big as advertised. Maybe I am doing it wrong. Maybe the documentation is wrong. Maybe both.

Opus Is A Marathon

Sonnet is 300M parameters. Opus is 600M. Double the parameters. Double the compute. If Sonnet takes 261 hours, Opus takes roughly 550 hours. That is 23 days. That is nearly a month of continuous training on a single GPU.

I planned to release Haiku, Sonnet, and Opus together. Haiku is done. Sonnet is training. Opus is a distant horizon. A beautiful dream. A dream that will not be real until next month at the earliest.

And that is assuming nothing crashes. Assuming the power does not go out. Assuming my liquid cooler does not develop a leak. Assuming I do not lose my mind watching a progress bar for 23 days.

Why I Keep Going

I could use cloud GPUs. I could rent a cluster. I could finish in days instead of weeks. I do not have the money. I have a 5090 and stubbornness. The 5090 is fast. The stubbornness is faster.

I could use smaller models. I could train Haiku variants. I could be practical. I am not practical. I have never been practical. Practical people do not buy liquid cooled GPUs.

I could stop. I will not stop. The model will train. The loss will go down. Eventually. Maybe. Probably.

The Silver Lining

At least the training is stable. My locked clocks mean consistent iteration times. I know exactly when each checkpoint will save. I know exactly when I can sleep. I know exactly how long I have to wait.

Knowing you have to wait 261 hours is different from wondering if it will finish. Uncertainty is painful. Certainty is just depressing. I prefer depressing.

Final Thoughts

Every optimization is enabled. Everything should be fast. Nothing is fast. I am tired. My GPU is hot. My electricity bill is a war crime. The model is training. Slowly. Painfully. But it is training.

If you are thinking about training models locally, do it. It is fun. It is educational. It is free if you ignore the hardware costs. It is also slow. So slow. You will question every decision that led you here.

Do it anyway. The loss curve going down is worth the wait. Even if the wait is 261 hours. Even if it is 550 hours. Even if it is forever.